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Summary 

Principal component analysis is a useful dimension reduction and data visualization method. 
However, in high dimension, low sample size asymptotic contexts, where the sample size is 
fixed and the dimension goes to infinity, a paradox has arisen. In particular, despite the useful 
real data insights commonly obtained from principal component score visualization, these scores 
are not consistent even when the sample eigenvectors are consistent. This paradox is resolved by 
asymptotic study of the ratio between the sample and population principal component scores. In 
particular, it is seen that this proportion converges to a non-degenerate random variable. The re- 
alization is the same for each data point, i.e. there is a common random rescaling, which appears 
for each eigen-direction. This then gives inconsistent axis labels for the standard scores plot, yet 
the relative positions of the points (typically the main visual content) are consistent. This paradox 
disappears when the sample size goes to infinity. 
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1. Introduction 

Visualization of high dimension, low sample size data by principal component analysis has 
proven to be very useful. A recent example is shown in Figure 1 , which studies Next Generation 
Sequencing for a single gene, in a cancer study, from The Cancer Genome Atlas (TCGA, 2005). 
The data objects are n = 180 curves (each from one biological tissue sample), reflecting the 
log base 10 read depth, at around d = 1700 genome map locations. Relative positions of these 
biological samples are visualized, using a standard principal components scores plot, in Panel 
(A) of Figure 1. The plot shows the projection of the data onto the subspace generated by the 
first two eigenvectors. Note that there is distinct visual impression of three clusters. To investigate 
this clustering, the clusters have been manually brushed, with three different colors, as shown. To 
investigate whether these clusters represent important scientific phenomena, the same coloring 
is applied to the raw data curves in Panel (B). The distinct blocks in Panel (B) represent different 
exonic regions of the genome, and the jumps in the curves at the boundaries of these blocks 
indicate splicing events. The red curves are generally very low (recall the log scale) indicating 
very low levels of expression of this gene, for these samples. The black and blue curves are 
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Fig. 1. Principal component analysis of a Next Genera- 
tion Sequencing cancer data set. The scores plot in Panel 
(A) suggests three clusters, which are manually brushed 
with colors. Relevance of these visually discovered clus- 
ters is studied in Panel (B), by showing curves with the 
brushed colors, which reveals a biologically important al- 
ternate splicing event. 



generally much higher, showing much stronger gene expression. The black and blue curve differ 
strongly over the fourth exonic region, between 1000 and 1400, where the blue samples show 
a clear deletion of this exon. Finding such deletions is an important goal in cancer research, 
as they can form the basis of targeted treatments. Note that this is just one example, where 
scientifically important structure in data has been discovered by principal component analysis in 
a high dimension, low sample size context. 

We are interested in investigating the mathematical underpinnings of this visual approach to 
data analysis demonstrated in Figure 1. There are several approaches to this in the literature. 
Given the nature of genetic data, we prefer to study high dimension, low sample size asymp- 
totics. This approach considers increasing dimension d — > oo for a fixed sample size n. It has 
recently been studied in various multivariate analysis contexts, including geometric represen- 
tation of high dimensional data (Hall et al., 2005), clustering (Ahn et al., 2012), and principal 
component analysis (Ahn et al., 2007; Jung & Marron, 2009; Jung et al., 2012; Yata & Aoshima, 
2012; Shen et al., 2012a). However, these asymptotic analyses of principal component analysis 
all focused on studying the angles between the sample eigenvectors and the corresponding pop- 
ulation eigenvectors. For example, under some mild conditions, Jung & Marron (2009) showed 
that such angles go to 0, which is defined as the consistency of the sample eigenvectors. 

In this paper, we take a deeper look by studying principal component scores, shown as the 
circles in Panel (A) of Figure 1 . Our analysis surprisingly reveals an apparent paradox under 
the high dimension low sample size setting: principal component scores are inconsistent with 
the corresponding population scores, even when the sample eigenvectors are consistent. Further- 
more, for a fixed n and a particular principal component, as d goes to infinity, the proportion 
between the sample scores and the corresponding population scores converges to a random vari- 
able, whose realization is the same for each observation. The findings suggest that, although the 
principal component scores can not be consistently estimated, the scores scatter plots, such as 
Panel (A) of Figure 1 , can still be used to explore interesting features of high dimension, low 
sample size data, because the relative positions of the points are consistent due to the common 
scaling. Finally, this phenomenon disappears when the sample size tends to infinity. In particular, 
both the sample eigenvectors and the sample principal component scores are then consistent. 
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2. Notations and Assumptions 

Assume that X\ , . . . , X„ are a random sample from the <i-dimensional normal distribution 
iV(£, £), and the population covariance matrix £ has the following eigen-decomposition: 

£ = f/A[/ T , 

where A is the diagonal matrix of the population eigenvalues Ai > . . . > Xd, and U is the corre- 
sponding eigenvector matrix such that U = [u\ , . . . , Ud] ■ Denote the jth normalized population 
principal component score vector as 

Sj = (S hj , ■ ■ ■ , S n j) T = \J*(uJX u ■■■ , ujX n ) T , j = 1, • • • , d. (1) 
Let X be the sample mean. As discussed in Paul & Johnstone (2007), 

n n—1 

^2(Xi - X) (Xi - X) T has the same distribution as ^ ^ YjY? , 
i=l i=l 

where Y{ are independent and identically distributed random variables from N(0, S). It follows 
that the sample covariance matrix is location invariant. Without loss of generality, we assume 
that Xi, . . . , X n are a random sample from the d-dimensional normal distribution iV(0, S). 

Denote the data matrix as X = [X±, . . . ,X n ], and the sample covariance matrix as £ = 
rT x XX T , which has the following eigen-decomposition, 

t = UAU T , (2) 

where A = diag(Ai, • • • , A^) is the diagonal sample eigenvalue matrix, and U = [u\,. . . ,Ud] 
is the corresponding sample eigenvector matrix. In addition, the matrix X/y/n has the fol- 
lowing singular value decomposition: X/y/n = Yl d j=i ^jV-jVj, where Vj = (vij,--- ,v n ,j), 
j = 1, ■ ■ ■ , d. Then the jth normalized sample principal component vector is 

Sj = (Si,j,--- ,Sn,j) T = (vi,j,--- ,v n ,j) T , j = l, ■■■,(!■ (3) 
Panel (A) of Figure 1 shows a scatter plot of the S^i versus S^, i = 1, ■ ■ ■ , n. 



3. Asymptotic Properties of Principal Component Scores 

The asymptotic properties of principle component scores in high dimension, low sample size 
contexts are studied in Section 31, and as the sample size grows in Section 3-2. 

31. High Dimension, Low Sample Size Analysis 
In this subsection, we consider the high dimension, low sample size settings, where the sample 
size n is fixed and the dimension d goes to infinity. We consider multiple spike models (Jung & 
Marron, 2009) under which, as d — > oo, 

Ai > • • • > A m > A m+ i \ d ~ 1, (4) 

where Aj » Xj means that lim^ooAj/Aj = 0, and Aj ~ Xj means that c\ < hm d ^. 00 Xj/Xj < 
lim^ooAj/Aj < C2 for constants c\ < c^. 

Under the above spike models, Jung & Marron (2009) showed that when n is fixed, if d/X rn — > 
0, the angle between each of the first m sample eigenvectors uj and its corresponding population 
eigenvector Uj goes to with probability 1, which is defined as the consistency of the sample 
eigenvector. 
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However, under the same assumptions, an anonymous reviewer identified a paradoxical phe- 
nomenon in that the sample principal component scores are not consistent. In addition, our anal- 
ysis suggests that, for a particular principal component, the proportion between the sample prin- 
cipal component scores and the corresponding population scores converges to a random variable, 
the realization of which remains the same for all data points. These results are summarized in 
the following Theorem 1. The findings suggest that it remains valid to use score scatter plots as 
a graphical tool to identify interesting features in high dimension low sample size data. 

Theorem 1. Under Assumption (4), and for the fixed n, as d — > oo, if d/\ m — > 0, then the 
proportion between the sample and population principal component scores satisfies 

A Rj, i = 1, ■ ■ ■ n, j = 1, • • • , m, (5) 

where A stands for convergence in probability, and Rj has the same distribution as y/n/Xn 
with Xn being the Chi-square distribution with n degrees of freedom. 

Remark 1. Under the assumptions in Theorem 1, Jung & Marron (2009) and Shen et al. 
(2012b) have shown that the angle between the sample eigenvector uj and the corresponding 
population eigenvector Uj, for j = 1, ■ ■ ■ , m, converges to with probability 1, which suggests 
that the sample eigenvectors are consistent, although the principal component scores are incon- 
sistent under the same assumptions. 

Remark 2. It follows from (5) that the ratio Rj only depends on j (the index of the principal 
components), but not i (the index of the data points). This particular scaling suggests that the 
scores scatter plot, such as Panel (A) of Figure 1 , has incorrectly labeled axes (by the common 
factor Rj for the corresponding axis), and yet asymptotically correct relative positions of the 
points; hence the scatter plot still enables meaningful identification of useful scientific features 
as demonstrated in Panel (B). 

3-2. Growing Sample Size Analysis 

In this subsection, we consider growing sample size contexts, where n — > oo, and then study 
the asymptotic properties of the principal component scores. This includes a wide range of set- 
tings, including classical asymptotics, where dimension d is fixed, random matrix asymptotics 
where d ~ n and more, see Shen et al. (2012b) for an overview. Unlike the low sample size 
setting, the apparent inconsistency paradox now disappears. This means that both the sample 
eigenvectors and the sample principal component scores can be consistent. 

We consider the following multiple spike models, as n — > oo, 

Ai >- ■ ■ ■ >- \ m > A m+ i ~ ■ ■ ■ ~ X d ~ 1. (6) 

Here \ >- Xj means that lim^ooAj/Aj < 1. Compared with the multiple spike models (4), the 
multiple spike assumption (6) is weaker because we have more sample information (n — > oo). 

Theorem 2 suggests that as n — > oo, the proportion between the sample scores and the corre- 
sponding population scores tends to 1. This connects with th e above results, from the fact that the 
ratio Rj in (5) has the same distribution as an asymptotic \Jnjx\ distribution which converges 
almost surely to 1 as n — > oo. Thus, it is not surprising that the apparent inconsistency disappears 
as the sample size grows. 



5. 
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THEOREM 2. Under Assumption (6), and as n — > oo, if d/\ m — > 0, then the proportion be- 
tween the sample and population principal component scores satisfies 



a.s 



1, i = 1,- ■ -n, j = 1,- ■ ■ ,m, (7) 



where — >■ stands for almost sure convergence. 

Remark 1. Under the current context, the consistency of the sample principal component 
scores fits as expected, with the fact that the sample eigenvectors are consistent under the as- 
sumptions of Theorem 2. In particular, Shen et al. (2012b) have shown that, under the same 
assumptions, the angle between the sample eigenvector Uj and the corresponding population 
eigenvector Uj for j = 1 , • • • , m converges almost surely to 0. 



Appendix 

In this section, we provide the technical details of the proofs for Theorems 1 and 2. First, we present 
two lemmas from Shen et al. (2012b), that will be used to prove Theorems 1 and 2. 

LEMMA 1 . Under the assumptions in Theorem 1 and as d — > oo, the sample eigenvalues satisfy 

Aj A^, j = l,---,m, (Al) 



\ 3 n 



and the sample eigenvectors satisfy 



I \l \j 2 \uju k \ A 1, for 1 < k = j < m, or for 1 < k ^ j < m, 
lEt m+ i(^) 2Z >0, forl<j<m. 

LEMMA 2. Under the assumptions in Theorem 2 and as n — >• oo, the sample eigenvalues satisfy 

^^>1, j = l,...,m, (A3) 

and the sample eigenvectors satisfy 

{ A fe ^J 2 \uju k \ -Al, for 1 < k = j < to, or for 1 < k ^ j < to, 
1EL + i(^) 2 ^0. forl<j<m. 

Note that has the following decomposition 

d 

X i = ^2\?u j z iJ , (A5) 

where the Zi j's are independent and identically standard normally distributed for i = 1, • • • , n, j = 
1, • • • , of. It follows from (1) and (A5) that the jth population principal component scores are 

Sj = {Sij,- ■ ■ , S n ,j) T = {zi.j, • • • , z nj ) T . (A6) 
From (3), the jth sample principal component scores are 

Sj = (Sij, • • • , S nJ ) = V (uJX l5 • • • , uJX n ). (A7) 
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From (A5), (A6) and (A7), we have that the proportion between the sample principal component scores 
and the corresponding population scores are 

1 i i i 

j -T Z%,k ~ T A,, Z% k ~T / » r»\ 

-^ = 7 4-ujM J -+ > -^ r ^u 1 j u k + > -^^ujuk. (A8) 

O, , \ 2 i ' \ 2 _ ^ ' \ 2 _ 

' J -Aj l<k<m,k=tj Aj ZiJ fc=m+l Aj- ZiJ 

Proof of Theorem 1. It follows from Lemma 1 that as d — > oo 

i i 

\ 2 _m_ \ j 

I^J^lAi?,, ^ ?p«J« t A0, (A9) 

A| l<k<m,k^j Zi,j 

where i? 3 has the same distribution as y/n/x^. Without loss of generality, we assume that Afe = 1 for 
k = m + 1, • • • , rf. Then it follows from Cauchy-Schwarz inequality that 

E <^t{t^ E *?,*W E (^) 2 ) • < A10 > 

hl^jZij J A J'^J I m fc=m+l J U=m+1 J 

From Lemma 1 and (A10), the last item in the right-hand-side of Equation (A8) converges to with 
probability 1. Combining the above with (A9), we obtain (5). In addition, it follows from (A2) that the 
angle between iij and Uj converges to with probability 1, which concludes the proof of Theorem 1. 

Proof of Theorem 2. The proof of Theorem 2 is similar. To avoid overlap, details are not given here. The 
critical difference in the proof of Theorem 2 is the use of Lemma 2, i.e. (Al) should be replaced by (A3). 
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